feat(celery Wave 3): T3.1 hard-cut legacy Celery indexing layer by earayu · Pull Request #1729 · apecloud/ApeRAG

earayu · 2026-04-27T02:27:13Z

Summary

Wave 3 hard-cut of the legacy Celery indexing stack — the final phase of the celery indexing redesign per docs/modularization/indexing-redesign-design-pack.md. Replaces Wave 1 (foundation + 5 modalities, PR #1726) and Wave 2 (runtime + cutover/quota + load test, PR #1727) into a single deployable surface with no Celery / no aperag.tasks/* / no aperag.domains.indexing/*_index.py.

80 files changed, +4169/-11969 (net ~7800 LOC removed)
13 commits across 3 phases: alembic + dispatcher (commits 1-2), runtime wire (commit 3), caller migration (commit 4a/4b), Part 2 hard-cut (chunks 1a / 1b / 2 / 3)
5 commits worked on the shared chenyexuan/celery-wave3-cutover branch by chenyexuan + Bryce (T3.2 / T3.3 lane disjoint write-set per architect msg=70a20f0e)

Per-commit highlights

commit	author	scope
`930cf20`	chenyexuan	alembic migration — drop legacy + ALTER NOT NULL + rename to canonical
`9aef2a7`	chenyexuan	`dispatcher.py` + cleanup path C (collection-deletion cascade)
`5325788`	Bryce	T3.2 + T3.3 — `SearchResultMetadata` §G.5 + private-deploy + INDEXING_MODE=inline smoke
`c941526`	chenyexuan	`Config.INDEXING_MODE` + FastAPI lifespan wire-in for indexing runtime
`c44a2de`	Bryce	T3.3 follow-up — vision-only inline mode smoke
`e602f1d`	chenyexuan	move `extract_keywords` helper to `aperag/indexing/keyword_extract.py`
`39aad24`	chenyexuan	migrate 7 production callers to §F.1 schema
`a076a13`	chenyexuan	Pattern A/B/C migration of 6 knowledge_base Celery tasks
`5583e63`	chenyexuan	inline `processing_lease` helpers + remove `flower` dep
`94c1d2c`	Bryce	inline `CollectionSummaryCallbacks`
`5b691db`	chenyexuan	simplify task bodies + Pattern B loop integration
`4173af4`	chenyexuan	hard-delete legacy Celery + indexing layers + tablename rename
`d254dd6`	chenyexuan	wire new-API + final grep 0 + alembic smoke + selective test delete

Architectural changes

Pattern A/B/C task migration (per architect msg=3890c9d7):

Pattern A (durability-required, sync): collection_delete_task — sync DB tombstone + periodic Path-C cascade
Pattern B (periodic): cleanup_expired_documents_task + reconcile_collection_summaries_task merged into the existing 5-min cleanup loop / 30-s reconciler loop via new hooks (cleanup_expired_documents_hook in aperag/indexing/cleanup.py + reconcile_collection_summaries_hook in aperag/indexing/reconciler.py)
Pattern C (fire-and-forget): collection_init_task, collection_summary_task, export_collection_task, evaluation/knowledge_graph tasks — asyncio.create_task(asyncio.to_thread(...))

§F.3 cutover: 3-statement TX (status=ACTIVE → demote-old → promote-new) lives in aperag.indexing.orchestrator._finalize_active_with_cutover, runs inside the worker's own session immediately after sync() succeeds. Reconciler does NOT split this (per architect msg=492315e8 Ruling 1 — splitting introduces an ACTIVE-but-not-is_serving inconsistency window the spec forbids).

Service → dispatcher layering: aperag/domains/knowledge_base/service/document_service.py 5 callsites consume the new aperag.indexing.dispatcher.dispatch_indexing() + aperag.indexing.cleanup.cleanup_for_deleted_documents() through the process-local IndexingRuntime singleton (aperag/indexing/runtime.py). The FastAPI lifespan installs the singleton; service layer reads it. No service-layer code imports FastAPI.

Tablename rename: aperag/indexing/models.py __tablename__ = "document_index_v2" → "document_index" + 5 index names *_v2_* → canonical. Matches alembic d0f4c1b9a8e2 post-state. No new alembic revision needed.

Files deleted

aperag/tasks/* — entire dir (collection / document / models / processing_lease / reconciler / scheduler / utils)
aperag/concurrent_control/* — entire dir (manager / protocols / redis_lock / threading_lock / utils + READMEs)
aperag/domains/indexing/{tasks,orchestration,manager,vector_index,fulltext_index,graph_index,summary_index,vision_index}.py + aperag/domains/indexing/db/models.py
config/celery.py
tests/unit_test/{concurrent_control,tasks}/* — contract tests for now-deleted modules
tests/unit_test/test_es_*.py — legacy ES contract tests
scripts/migrate_es_fulltext_shared_index.py — one-time Wave-1-era migration script
pyproject.toml: dropped celery<6.0.0,>=5.3.1 + django-celery-beat<3.0.0,>=2.5.0 + flower

Final-HEAD gates

grep "from aperag.tasks\|import aperag.tasks\|from aperag.concurrent_control\|from aperag.domains.indexing.(tasks|orchestration|manager|*_index|db.models)\|from config.celery\|^from celery\|^import celery" over aperag/ + config/ + scripts/ → 0 hits in production code ✅
alembic upgrade head → succeeds ✅; alembic downgrade -1 then upgrade head → reversible round-trip ✅
ruff check + format --check over aperag/ tests/ scripts/ → clean (491 files) ✅
pytest tests/unit_test/ tests/load/ --ignore=objectstore → 900 passed / 29 skipped / 0 failed ✅ (objectstore needs moto extra, pre-existing)

Test plan

alembic upgrade head + downgrade -1 + re-upgrade head reversible ✅
pytest unit + load suites (excluding pre-existing missing-moto) → 900 passed / 0 failed
ruff check + format --check clean
grep production code = 0 hits for legacy import patterns
e2e-http-smoke + e2e-http-provider + provider-preflight (CI to run)
manual smoke: upload document → confirm dispatch fires → verify worker writes (document_id, parse_version, modality) row + backend state
manual smoke: delete document → confirm cleanup_for_deleted_documents drops rows + backend state
private-deployment smoke per docs/modularization/private-deployment.md (T3.3 lane)

🤖 Generated with Claude Code

… NOT NULL + rename to canonical Wave 3 hard-cut schema migration per architect msg=4a801b2b (Wave 1 Bug 2 ruling that locked the temporary v2 suffix) + msg=498b12f0 (Wave 2 informational item ruling that promoted dispatch columns NOT NULL in Wave 3) + PM acceptance msg=5939e394 item 1. Migration revision d0f4c1b9a8e2 chains off c2e8d5a1f3b9 and: 1. DROP TABLE document_index CASCADE — the legacy Celery-era table that lived alongside the Wave 1 v2 table during the transition. Pre-launch + no callers in Wave 3 (the dependent code is hard- deleted in subsequent commits of this same PR). 2. ALTER COLUMN collection_id, source_path → NOT NULL on document_index_v2. Wave 1 fixtures used NULL for back-compat; Wave 3 orchestrator + reconciler always populate them (per architect msg=498b12f0 Lock). 3. Rename every index *_v2_* → *_*. The partial-unique uniq_document_index_v2_serving is dropped + re-created (PG ALTER INDEX RENAME does not regenerate the WHERE predicate symbol map per Postgres quirk; SQLite would silently keep the old reference). 4. RENAME TABLE document_index_v2 → document_index — back to the §F.1 canonical name (architect msg=4a801b2b lock). The downgrade reverses every step in mirror order so a rollback can replay subsequent migrations cleanly. The recreated legacy ``document_index`` table on downgrade is intentionally schema-less (only the id PK column) because the legacy class was deleted in the Wave 3 PR alongside this migration — operators rolling back past this point must restore the legacy ORM file before re-running upgrades. There is no production scenario for that. This is commit 1/5 of T3.1; subsequent commits land the FastAPI wire-in, knowledge_base/tasks.py Pattern A/B/C migration of the 6 remaining Celery tasks, the 9 production caller migrations, and the legacy file-layer hard-delete + audit allowlist removal + pyproject Celery/kombu dep removal. Design pack §F.1 + §F.5 amends (per architect msg=498b12f0 + msg=3890c9d7 path C ruling) are deferred to a follow-up commit once PR #1725 (which owns docs/modularization/indexing-redesign- design-pack.md) merges — flagged in the channel. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ion-deletion cascade) Wave 3 wire-in helpers per architect msg=268f9022 (Wave 3 spec) + msg=3890c9d7 (Pattern A path C ruling). Adds the upload-side dispatcher + the cleanup worker's third path so commits 3-5 can wire FastAPI + migrate the 6 knowledge_base/tasks.py Celery tasks without inventing new abstractions. aperag/indexing/dispatcher.py (301 LOC, NEW): - DispatchRequest dataclass — collection_id / document_id / parse_version / source_path / tenant_scope_key / modalities tuple - IndexingMode enum — ASYNC (queue + worker pool) / INLINE (synchronous derive + sync per modality, for tier-1 private deployments per design pack §L) - dispatch_indexing() async helper — INSERTs N PENDING rows in one transaction (collection_id + source_path + tenant_scope_key are populated per the design pack §F.1 amended NOT NULL columns) + finalizes per mode (queue.push for ASYNC; process_one_task call for INLINE) - modalities_for_collection() helper — maps per-modality enable flags to a canonical-order modality tuple, useful for HTTP handlers - Fail-fast on missing dependency: raises ValueError if mode=ASYNC with no queue, or mode=INLINE with empty workers (catches config bugs at the HTTP boundary, not mid-INSERT) aperag/indexing/cleanup.py (extended +131 LOC): - New "Path C" cleanup_for_deleted_collections() per architect msg=3890c9d7 Pattern A. Three-step cascade: 1. Find all distinct document_ids in document_index referencing each deleted collection_id 2. Cascade to path B (cleanup_for_deleted_documents) for those documents — that path already handles modality fan-out (graph lineage cleanup vs flat backend delete) 3. Sweep any remaining document_index rows by collection_id (covers the edge case where a row was orphaned earlier or the collection had rows queued before any document indexed) - Idempotent: a partial cascade that crashes mid-way is resumed on the next call (Pattern B reconciler scan that sweeps tombstoned collections) - Counts dict adds collections_cleaned key - Module docstring rewritten to describe THREE paths (was TWO) aperag/indexing/__init__.py: - Re-exports cleanup_for_deleted_collections + 6 dispatcher symbols (DispatchRequest, IndexingMode, DEFAULT_MODALITIES, dispatch_indexing, modalities_for_collection, all_modalities) tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py (8 cases): - dispatcher_async: INSERTs N rows + pushes payloads to per-modality queue + leaves DB rows PENDING with correct scoping fields - dispatcher_async_requires_queue: fail-fast on None queue - dispatcher_inline: INSERTs + invokes process_one_task → row ends ACTIVE + is_serving=TRUE in one TX (§F.3) - dispatcher_inline_requires_workers: fail-fast on empty workers - modalities_for_collection: canonical order + subset selection - path_c_cascades_via_path_b: 3 collection rows (2 doc + 1 ghost) → 3 backend deletes + 3 row deletes; other-collection row untouched - path_c_handles_empty_input: counts dict zeroed - path_c_idempotent_on_re_run: second call returns rows_deleted=0 Local pytest: tests/unit_test/indexing/test_t3_1_dispatcher_path_c.py 8/8 passed. Lint + format clean across new + extended files. Note: this commit does not yet wire dispatcher into the FastAPI app (commit 3) or migrate the 6 knowledge_base/tasks.py Celery tasks per Pattern A/B/C (commit 4). Bryce can now start T3.2 + T3.3 on top of this branch — the dispatcher shape is the stable API both lanes depend on (T3.3 inline mode reuses dispatch_indexing(mode=INLINE) unchanged; T3.2 search API does not depend on dispatcher). Branch is rebased on main HEAD f370dc6 (PR #1725 design pack merged, so subsequent commits can amend §F.1 / §F.5 directly if any new spec drift surfaces during implementation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…+ INDEXING_MODE=inline smoke Per docs/modularization/indexing-redesign-design-pack.md §G.5 + §L + architect msg=268f9022 (Wave 3 spec) + msg=3890c9d7 (path-C ruling) + msg=c685f83e (PR #1725 §F.1/§F.5 amendments merged). Two Wave 3 lanes shipped together because they share no production- code surface with chenyexuan T3.1 commits 1-2 (this commit's diff is purely additive: 1 new helper + 1 schema extension + 1 docs file + 2 test files): # T3.2 — SearchResultMetadata §G.5 extension aperag/domains/retrieval/schemas.py: * New typing aliases ``IndexerModality`` (vector/fulltext/graph/ summary/vision) + ``IndexStateValue`` (ACTIVE/FAILED/NOT_ENABLED/ INDEXING). * Three new optional fields on SearchResultMetadata: ``parse_version``, ``index_modality``, ``index_state_per_modality``. ``extra="forbid"`` config preserved — the §G.5 additions widen the allowlist by exactly three entries; a typo / future shadow field still fails Pydantic validation loudly. * ``modality`` (D10.h-locked content shape: text/image) kept as-is. The §G.5 spec uses bare ``modality`` for the indexer modality, but the existing public surface already binds that name to content shape; renaming would break D10.h. We chose ``index_modality`` for the indexer modality to disambiguate at the schema level. (Spec narrative §G.5 may want a follow-up to use the same name; not blocking.) * ``from_raw()`` extracts the three new fields from upstream raw metadata, with shallow validation that drops malformed entries (unknown keys / non-string values) before they leak to clients. Accepts both ``index_modality`` and the legacy ``indexer`` key for backward compat with vector/fulltext/graph indexers that haven't been rewired. aperag/indexing/index_state.py (NEW, 165 lines): * Pure-read helper ``query_index_state_for_documents(engine, collection_id, document_ids)`` returning the ``{document_id: {modality: state}}`` shape SearchResultMetadata expects. Single batched read against ``document_index`` so the search pipeline can hydrate metadata for an entire result page in one DB round-trip rather than N+1. * Translation contract pinned: ``status=ACTIVE AND is_serving=TRUE`` → ``ACTIVE``; ``status=FAILED`` → ``FAILED``; everything else (PENDING / RUNNING / ACTIVE-but-not-serving §F.3 cutover transit) → ``INDEXING``; missing row → ``NOT_ENABLED``. Per §F.4 the cutover transit window reads as INDEXING for client purposes. * Dense result map: every document_id key always carries every modality. Stable shape so clients don't have to reason about "field missing means what?". * Module-local re-declaration of ``IndexStateValue`` so ``aperag.indexing`` does not import from ``aperag.domains.retrieval`` (dependency runs in the other direction). Two literals MUST stay in sync. tests/unit_test/indexing/test_t3_2_index_state.py (NEW, 20 cases): * Schema validation: §G.5 fields accepted / extra="forbid" still rejects unknown / IndexerModality + IndexStateValue Literals reject unknown values. * from_raw extraction: §G.5 fields populated / legacy ``indexer`` key fallback / malformed entries dropped silently / D10.h-locked fields unchanged / empty input returns None. * DB helper: empty-input fast path / dense NOT_ENABLED for un-enqueued docs / ACTIVE+serving → ACTIVE / ACTIVE-but-not- serving → INDEXING (§F.3 cutover transit) / PENDING + RUNNING → INDEXING / FAILED → FAILED / per-collection_id filtering / serving row wins over PENDING sibling under §F.3 cutover model / per-modality independence under partial failures. # T3.3 — private deployment docs + INDEXING_MODE=inline smoke docs/private-deployment.md (NEW, 249 lines): * §L Tier 1 / Tier 2 / Tier 3 deployment guide for operators. * Highlights "deploy and forget" mechanisms — every resource that would rot has a corresponding self-heal (§F.5 Path A/B/C, §I.2 retry, §H.5 quota fallback, §C.7 atomic write). * Tier 1: ``pip install aperag && aperag serve`` with SQLite + LocalFS + ``INDEXING_MODE=inline``; no Redis, no separate worker. * Tier 2: docker-compose with PostgreSQL + Redis + MinIO + 5 modality workers + reconciler + cleanup loop; standard customer install on a single VM. * Tier 3: Tier 2 spread across multiple VMs sharing Redis + DB + S3-compatible store. No code change between tiers. * §J.1 SLI table for operators wiring OTLP collectors. * "When to escalate" section: which signals indicate the steady- state self-heal is not converging. tests/integration/test_inline_mode_smoke.py (NEW, 2 cases): * End-to-end smoke for ``IndexingMode.INLINE`` — parse → dispatch → every requested modality at status=ACTIVE + is_serving=TRUE, driven synchronously through chenyexuan T3.1 dispatcher ``9aef2a7``. No Redis, no queue, no separate worker process. * Vision intentionally excluded from the multi-modality smoke because vision derive consumes a JSON list of image records (not chunks.jsonl) and the dispatcher takes a single source_path; the per-modality source_path resolution is the FastAPI lifespan layer's job (chenyexuan T3.1 commit 3, out of scope for T3.3). * Subset-modality test: ``DispatchRequest.modalities`` lets a Tier 1 deploy turn off expensive modalities (e.g., no GPU → skip vision) and only the requested rows finalise. * Stays in default PR-gate suite (no @pytest.mark.slow) since in-memory backends finish in ~1 s. # §G hard-gate self-audit * #1 contract shape: 5 net-new files + schemas.py +93 lines (allowlist widening only). No existing API surface narrowed; the D10.h-locked content modality field is preserved. * #4 caller migration: search pipeline integration is intentionally deferred to chenyexuan T3.1 commit 3 (FastAPI lifespan + caller migration); the read helper in this commit is the seam that pipeline.py will call once wire-in lands. * #5 cross-stack: write set strictly disjoint from chenyexuan T3.1 commits 1-2 (alembic + dispatcher.py + cleanup.py); chenyexuan commit 3-5 changes orchestrator/reconciler/FastAPI app/legacy deletes — also disjoint from this commit's writes. # Lint + tests * ``uvx ruff check + ruff format --check`` across aperag/ + tests/ clean. * ``pytest tests/unit_test/indexing/ tests/integration/ test_inline_mode_smoke.py tests/load/ tests/unit_test/ test_phase3_reexport_audit.py`` → 136 passed, 0 failed (84 Wave 1+2 + 8 T3.1 dispatcher path-c + 20 new T3.2 + 2 new T3.3 + 2 load + 2 phase3 audit). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… wire-in for indexing runtime Wave 3 wire-in step per architect msg=268f9022 §K T3.1 spec item 4. Adds the runtime entry point that launches the per-modality worker pool + reconciler + cleanup loop on app startup when ``INDEXING_MODE=async`` (default), and the in-process ``WorkQueue`` + ``Engine`` references that future request-handler dispatchers will import via ``app.state``. aperag/config.py: - Add ``Config.indexing_mode: str = Field("async", alias="INDEXING_MODE")``. Two values per design pack §L: * "async" → orchestrator + reconciler + cleanup loops launched at app startup; upload handlers RPUSH to per-modality queue; workers BLPOP and process. Production / tier-2/3. * "inline" → upload handlers call ``dispatch_indexing(mode=INLINE)`` which runs derive + sync + cutover synchronously within the request coroutine; no worker pool, no Redis. Tier-1 single-process private deployments. aperag/app.py: - Extend ``combined_lifespan`` to launch the indexing runtime under ``settings.indexing_mode == "async"``: * 5 per-modality worker tasks (run_vector / run_fulltext / run_graph / run_summary / run_vision) * 1 reconciler loop task (run_reconcile_loop) * 1 cleanup loop task (run_cleanup_loop) All as ``asyncio.create_task()`` background tasks owned by the FastAPI process — matches the §E.2 "one Python process per modality" architecture for the in-process deployment topology. Tier-3 horizontal scale-out runs separate worker processes; that wiring lives in a future ops launcher (out of T3.1 scope). - Single process-local ``InMemoryWorkQueue`` is the default transport. Tier-3 production swaps for a Redis-backed ``WorkQueue`` (RPUSH / BLPOP) by injecting via ``app.state`` at deploy time — Wave 3 follow-up. - Stash ``app.state.indexing_queue`` + ``app.state.indexing_engine`` for upload-side dispatchers to reach (commit 4 wire-in target). - Worker registry passed to cleanup loop is empty by default; T3.3 follow-up wires concrete production backends per modality. The cleanup loop tolerates an empty registry (path A logs warning + skips backend delete; row still GC'd from DB). - ``_placeholder_worker_factory`` raises NotImplementedError on invocation — T3.1 ships the queue-side scaffolding (commits 4-5 wire concrete factories per modality). The orchestrator's per-task BLPOP loop only invokes the factory when a payload is popped; until commit 4 wires the upload path nothing pushes, so the placeholder is never reached at runtime. - Shutdown drain: on lifespan exit, set ``shutdown`` event + ``await asyncio.gather`` all 7 background tasks with ``return_exceptions=True`` so a SIGTERM does not abort mid-task. Test impact: - Existing 136 indexing + load + Phase 3 audit tests still pass (lifespan code is opt-in via env var; no test imports it). - Commit 4 (upload-route migration to dispatch_indexing) and commit 5 (hard-delete legacy + concrete backend factories) build on this. Bryce's vision-modality smoke (deferred at T3.3 commit 5325788 because per-modality source path resolution = lifespan-layer concern) is now unblocked: ``app.state.indexing_queue`` is the seam through which a follow-up smoke can wire concrete VisionModality with the correct synthetic source_path per dispatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per chenyexuan msg=164efd52 / msg=f70d1288 + architect msg=7fd8f348 post chenyexuan T3.1 commit 3 ``c941526`` (FastAPI lifespan + INDEXING_MODE wire-in). The original T3.3 smoke (commit ``53257881``) excluded vision because vision's ``derive`` consumes a JSON list of image records, not chunks.jsonl, and the dispatcher takes a single ``source_path`` per request — single-call coverage for all 5 modalities was incompatible with that contract. This follow-up adds a vision-only smoke (with a per-modality source_path resolution example) so vision modality regressions are covered at the inline-mode layer. The production upload path (chenyexuan T3.1 commit 4 caller migration) will resolve per- modality source paths upstream of the dispatcher and issue per-modality ``DispatchRequest`` calls — this test demonstrates exactly that pattern. Test addition (1 case): seed an image-records JSON list under ``collections/<cid>/documents/<did>/source/images.json``, dispatch with ``modalities=(Modality.VISION,)`` + ``source_path=<images.json path>``, assert the row reaches ``status=ACTIVE`` AND ``is_serving=TRUE``. 3/3 tests in tests/integration/test_inline_mode_smoke.py now pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… aperag/indexing/keyword_extract.py Per architect msg=3890c9d7 commit-4 split (chenyexuan = Pattern A/B/C + extract_keywords; Bryce = 9 caller schema-aware migration), this commit lands the extract_keywords subsystem move that decouples the search-time keyword extraction helpers from the soon-to-be-deleted ``aperag/domains/indexing/fulltext_index.py`` (commit 5 hard-cut target). aperag/indexing/keyword_extract.py (NEW, 337 lines): - ``KeywordExtractor`` (abstract base for backward-compat with callers that may type-annotate the abstract type) - ``IKKeywordExtractor`` (Elasticsearch IK analyzer, default fallback, always available when ES is reachable) - ``LLMKeywordExtractor`` (optional LLM extractor with structured JSON parsing + simple-line fallback) - ``extract_keywords(text, ctx)`` (public entry point with LLM-then-IK fallback chain, signature unchanged from legacy) - ``_es_client_config()`` (private helper, inlined to keep the new module dependency-free of legacy fulltext_index.py) - Module docstring explains the SEARCH-side helper vs Wave 1 ``aperag/indexing/fulltext.py`` (write-side modality worker) split aperag/indexing/__init__.py: - Re-exports the 4 new symbols (KeywordExtractor + IKKeywordExtractor + LLMKeywordExtractor + extract_keywords) Caller migration (extract_keywords import sites): - ``aperag/domains/retrieval/pipeline.py:41`` — swap from legacy ``aperag.domains.indexing.fulltext_index`` to new ``aperag.indexing.keyword_extract`` - ``aperag/service/search_pipeline_service.py:34`` — same swap. This file's docstring explicitly notes the import alias is kept writable for ``monkeypatch.setattr("aperag.service.search_pipeline_service.extract_keywords", ...)`` test fixtures, so the new path is preserved as a writable alias. The legacy ``extract_keywords`` symbol still exists in ``aperag/domains/indexing/fulltext_index.py`` until commit 5 deletes the file — both sites work simultaneously, so any caller I missed is not silently broken in this intermediate state. Other DocumentIndex / FulltextSearchDegradedError / fulltext_indexer imports in ``aperag/domains/retrieval/pipeline.py`` (line 293) + elsewhere in pipeline.py are Bryce's commit-4a write set per the agreed split (msg=9d5d54b5 coordination note). chenyexuan changed ONLY the extract_keywords import line, leaving Bryce's hunks untouched. Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 + T3.2 + T3.3 + Phase 3 audit), 0 failed. Lint + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Per architect msg=ab8d473c pre-blessed split + chenyexuan msg=be26ebf3 + PM authorization msg=df9ea8d2: schema-aware migration of legacy ``aperag.domains.indexing.db.models.DocumentIndex`` callers to the new ``aperag.indexing.models.DocumentIndex`` (§F.1 canonical schema post Wave 3 commit 1 alembic ``930cf20``). # Field translation contract Wave 1+2+commit-1 merged the following schema deltas; this commit flips every production caller to the new shape: | Legacy (gone in Wave 3 commit 5) | New (§F.1) | |----------------------------------------------|-------------------------------------------------------| | ``DocumentIndex.index_type`` (enum) | ``DocumentIndex.modality`` (string) | | ``DocumentIndexType.GRAPH`` (Python enum) | ``Modality.GRAPH.value`` (lowercase string) | | ``DocumentIndexStatus.ACTIVE`` (Python enum) | ``IndexStatus.ACTIVE.value`` (string) + is_serving=TRUE | | ``DocumentIndex.gmt_created`` / ``gmt_updated`` | ``created_at`` / ``updated_at`` (mixin-aligned) | | ``DocumentIndex.index_data`` (JSON blob) | per-modality ``derived/parse_<v>/`` artifact paths | The "currently-serving" semantic now requires ``status=ACTIVE AND is_serving=TRUE`` per §F.3 cutover model — a row at ``status=ACTIVE`` but ``is_serving=FALSE`` is in the cutover transit window and is NOT yet user-visible. # Files migrated (7 of 9 in commit 4a list) * ``aperag/db/repositories/document_index.py`` — repository mixin: ``has_recent_graph_index_updates`` query rewritten + return type switched from ``DocumentIndexType`` enum to ``Modality`` / string. ``query_documents_with_failed_indexes`` now returns modality string values (lowercase) per the §F.1 column type. * ``aperag/domains/agent_runtime/runtime.py`` — inlined ``generate_processing_token`` (3-line stdlib uuid wrapper) since ``aperag.tasks.processing_lease`` is in chenyexuan's commit 5 hard-delete list. Per architect msg=3890c9d7 Item 1 Option B ("提取小 helper 到 agent_runtime 自己 module"). * ``aperag/domains/knowledge_base/db/models.py`` — ``Document.get_overall_index_status()`` rewritten: the legacy ``CREATING`` / ``DELETION_IN_PROGRESS`` intermediate states are gone in §F.1 (a single ``RUNNING`` covers in-flight work); ``COMPLETE`` now requires ``is_serving=TRUE`` per §F.3. * ``aperag/domains/knowledge_base/service/document_service.py`` — schema migration spans ``_get_index_types_for_collection`` (now returns ``Modality`` values), the document JOIN query (legacy ``index_type`` / ``index_data`` / ``gmt_*`` columns translated to ``modality`` / None placeholder / ``created_at``/``updated_at``), rebuild_failed_indexes (modality string compare instead of enum), rebuild_document_indexes (Modality enum list instead of DocumentIndexType). The legacy ``index_data`` JSON-blob reads in ``get_document_chunks`` / ``get_document_vision_chunks`` are replaced with ``derived_artifact_path`` probes that exercise the §F.1 partial-unique invariant; the actual chunk-list response is routed through a "return empty list" placeholder until chenyexuan T3.1 commit 4b plumbs the object-store read path. HTTP response shape stays stable (``index_data=None`` populated where callers previously read JSON). Service-layer ``document_index_manager`` calls remain — those are chenyexuan commit 5 hard-delete scope. * ``aperag/domains/knowledge_base/service/collection_summary_service.py`` — same ``index_data`` deprecation pattern: query touches the §F.1 serving rows for the partial-unique invariant probe, returns empty document_summaries until the object-store read path lands. * ``aperag/mcp/tools/get_document_metadata.py`` — ``DocumentIndex`` / ``DocumentIndexStatus`` import migrated; ``index_data`` JSON parse replaced with ``derived_artifact_path`` probe, chunk_count surfaced as 0 (placeholder until object-store read path lands). * ``aperag/mcp/tools/list_documents.py`` — same migration as get_document_metadata (page-level ``DocumentIndex`` lookup + chunk_count placeholder). # Out of scope (chenyexuan commit 4b / 5 lane) * ``aperag/domains/retrieval/pipeline.py`` + ``aperag/service/search_pipeline_service.py`` — chenyexuan handles ``extract_keywords`` import + Pattern A/B/C legacy task migrations there per the split agreement. * ``aperag/domains/knowledge_base/tasks.py`` — chenyexuan commit 4b Pattern A/B/C migration (collection_delete / cleanup_expired / collection_summary / collection_summary_reconciler / collection_init / export_collection). * ``document_index_manager.create_or_update_document_indexes`` / ``delete_document_indexes`` calls inside document_service — chenyexuan commit 5 hard-deletes the manager module so these callers will need switching to the new ``dispatch_indexing()`` / cleanup paths (chenyexuan's lane). # Lint + tests * ``uvx ruff check + ruff format --check`` clean across aperag/. * ``pytest tests/unit_test/indexing/ tests/integration/ test_inline_mode_smoke.py tests/load/ tests/unit_test/ test_phase3_reexport_audit.py`` → 137 passed, 0 failed. * Tests covering legacy ``aperag.domains.indexing.*`` modules (which chenyexuan commit 5 deletes) are not in the test set above; they are chenyexuan's commit 5 sweep scope. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…owledge_base Celery tasks Per architect msg=3890c9d7 Pattern A/B/C ruling, the 6 Celery tasks in aperag/domains/knowledge_base/tasks.py are migrated off Celery without losing their semantics. The decorators + Celery imports (``from celery import current_app`` + ``from config.celery import app``) are removed; each function is now plain Python that callers invoke per its category: aperag/domains/knowledge_base/tasks.py (-Celery, +Pattern A/B/C): - Module docstring rewritten — Pattern map for the 6 tasks - ``reconcile_collection_summaries_task`` (Pattern B, periodic) — no decorator; commit 5 wires into reconciler 30-s loop - ``collection_delete_task`` (Pattern A, durability-required) — caller invokes synchronously from HTTP handler; on failure raises HTTP 500 + the periodic Path C cleanup loop sweeps tombstoned rows - ``collection_init_task`` (Pattern C, idempotent) — no decorator; caller wraps in asyncio.create_task; failures log + reconciler picks up - ``collection_summary_task`` (Pattern C, regenerable) — no decorator; ``self.retry(...)`` removed (Celery-specific); failures flow through ``collection_summary_callbacks.on_summary_failed`` + reconciler picks up next cycle - ``cleanup_expired_documents_task`` (Pattern B, periodic) — no decorator; commit 5 merges into cleanup.py 5-min loop - ``export_collection_task`` (Pattern C) — ``self`` parameter removed; ``soft_time_limit`` / ``time_limit`` decorator args removed (now enforced via §H.6 ``bulkhead_timeout`` async ctx manager wrapped at the dispatch site) - Removed unused ``Any`` typing import + unused ``TaskConfig`` reference (was only used by removed ``self.retry()`` calls) - Function bodies still call legacy ``aperag/tasks/collection.py: collection_task.<method>()`` and ``aperag/tasks/reconciler.py:*`` helpers — commit 5 moves / inlines those helpers when it deletes the legacy ``aperag/tasks/`` layer entirely. aperag/domains/knowledge_base/service/collection_service.py: - ``collection_init_task.delay(...)`` (line 215) → Pattern C: ``asyncio.create_task(asyncio.to_thread(collection_init_task, instance.id, document_user_quota))`` so the HTTP response returns immediately. Failures log + the reconciler picks up. - ``collection_delete_task.delay(...)`` (line 438) → Pattern A: ``await asyncio.to_thread(collection_delete_task, collection_id)`` synchronous in the HTTP handler — durability-required per architect ruling msg=3890c9d7 (NOT fire-and-forget — losing this work = orphan rows + DB corruption). - Added ``import asyncio`` to module imports. aperag/domains/knowledge_base/service/export_service.py: - ``export_collection_task.delay(...)`` (line 104) → Pattern C: ``asyncio.create_task(asyncio.to_thread(export_collection_task, task.id))`` so the HTTP response returns immediately. The body is sync I/O (object-store + ZIP); the ExportTask DB row tracks progress; users retry from the UI on failure. Pattern B integration (cleanup_expired_documents_task + reconcile_collection_summaries_task into the existing 5-min / 30-s loops in aperag/indexing/{cleanup,reconciler}.py) is deferred to commit 5 — the functions still exist as plain Python, just no longer invoked via Celery beat schedule (config/celery.py beat schedule entries to be removed in commit 5 alongside the periodic loop integration). Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 + T3.2 + T3.3 + Phase 3 audit), 0 failed. Lint + format clean across all changed files. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…remove flower dep Wave 3 hard-cut Part 1 per architect msg=64fd506a fallback split (Part 2 atomic = next session). Two safe pieces that decouple the last knowledge_base-domain dependency on legacy ``aperag/tasks/processing_lease.py`` + drop a Celery-monitor dep that has no remaining production caller. aperag/domains/knowledge_base/tasks.py: - Removed ``from aperag.tasks.processing_lease import ...`` line (last surviving caller; Bryce commit 4a `39aad24` already inlined the agent_runtime caller) - Inlined the 4 public symbols from ``aperag/tasks/processing_lease.py`` (84 LOC verbatim): * ``DEFAULT_PROCESSING_LEASE_TTL_SECONDS`` * ``DEFAULT_PROCESSING_LEASE_RENEW_INTERVAL_SECONDS`` * ``generate_processing_token()`` * ``build_lease_expires_at()`` * ``ProcessingLeaseRenewer`` class (background lease-renewal thread) - Added ``import threading``, ``import uuid``, ``from typing import Optional`` to support the inlined symbols - Module section header explains Part 1 / Part 2 split — the legacy ``aperag/tasks/processing_lease.py`` file itself stays in Part 1 (Part 2 atomic deletes it together with the rest of ``aperag/tasks/`` after CollectionSummaryCallbacks + CollectionTask methods are inlined to their service-layer homes) pyproject.toml: - Removed ``flower<3.0.0,>=2.0.0`` dep (Celery monitoring dashboard, no production code import; verified ``grep -rn "import flower\| from flower" aperag/ tests/ config/`` returns 0) - Other Celery deps (``celery``, ``django-celery-beat``, ``kombu``) stay until Part 2 atomic — they are still imported by 4 files in Part 2's delete list (``aperag/tasks/scheduler.py``, two files in ``aperag/domains/indexing/``, and ``config/celery.py``) Notes scoped OUT of Part 1 (per architect msg=64fd506a): - ``aperag/concurrent_control/redis_lock.py`` deletion deferred: architect spec said "no production caller" but recon found internal callers in ``concurrent_control/__init__.py`` + ``concurrent_control/manager.py`` (the package itself uses it even though zero EXTERNAL imports of the package exist). Cleaner fix is to delete the whole ``aperag/concurrent_control/`` package in Part 2 atomic alongside the other dead-code sweeps. Local pytest: 137 passed (Wave 1 + T2.1 + T2.2 + T3.1 commits 1-4b step 2 + T3.2 + T3.3 + Bryce caller migration + Phase 3 audit), 0 failed. Lint + format clean. This is a partial commit 5; Part 2 (inline CollectionTask / CollectionSummaryCallbacks / Pattern B reconcilers + tablename rename + audit allowlist removal + legacy file-layer deletion + remaining Celery dep removal + legacy test deletion + final grep validation) is the next-session atomic push. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…allbacks Per architect msg=70a20f0e + msg=54063106 fallback ratify (Bryce takes Part 2) + PM msg=ef2e97b9 minimal-chunk-1 GO. Move legacy ``aperag/tasks/reconciler.py:CollectionSummaryCallbacks`` (~234 LOC) to its true owner: ``aperag/domains/knowledge_base/ service/collection_summary_service.py``. The class is the terminal callback hook the summary generation task invokes on success / failure to flip the ``CollectionSummary`` row's lifecycle (GENERATING → COMPLETE / FAILED) and propagate the generated text to ``Collection.description``. It belongs to the summary service layer, not the legacy task / reconciler layer that Wave 3 commit 5 deletes. * ``CollectionSummaryCallbacks`` class — three static methods (``_describe_summary_callback_mismatch``, ``on_summary_generated``, ``on_summary_failed``) inlined verbatim. No semantic changes; the query/update logic, token/version mismatch tolerance, and Collection.description propagation are preserved exactly. * Module-level ``collection_summary_callbacks`` singleton mirrors the legacy ``aperag.tasks.reconciler.collection_summary_callbacks`` attribute so callers can swap import path without changing the call shape. * ``aperag/domains/knowledge_base/tasks.py:373`` import switched to the new location. Removes the last `aperag.tasks.reconciler` callback import; the periodic-reconciler imports (``collection_summary_reconciler`` + ``collection_gc_reconciler``) remain pending for Part 2 chunks 1b / 2 / 3. This is the safe, surgical first chunk per architect msg=f3de18a0 chunked-OK ruling: intermediate-red CI is fine; the final HEAD must be green + grep 0 + alembic reversible before task #14 → ``in_review``. The next session will continue Part 2 chunks 1b (remaining inline migrations: CollectionTask methods, periodic reconcilers) → chunk 2 (deletions + tablename rename) → chunk 3 (verify + wire). Tests: 137 indexing/load/audit tests still green; lint clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ttern B loop integration Wave 3 hard-cut continuation per architect msg=3890c9d7 Pattern A/B/C ruling and PM msg=206eec7b chunk 1b spec (~300 LOC scope). aperag/domains/knowledge_base/tasks.py: - collection_delete_task: Pattern A — replace legacy collection_task.delete_collection() with sync UPDATE Collection.status =DELETED + gmt_deleted=NOW(); periodic Path-C cleanup_for_deleted_collections sweep cascades the deletion (5-min worst-case latency acceptable for low-frequency op) - collection_init_task: Pattern C — replace legacy collection_task.initialize_collection() with sync UPDATE Collection.status=ACTIVE; per-modality index provisioning is implicit lazy in the new modality-worker model (per architect hint msg=54063106) - cleanup_expired_documents_task: Pattern B — replace legacy CollectionTask.cleanup_expired_documents with inlined SQL tombstone scan (Document.status==UPLOADED AND gmt_created < now-1d) + best-effort object-store delete + soft-delete to EXPIRED - reconcile_collection_summaries_task: Pattern B — convert to thin sync shim around the new aperag.indexing.reconciler hook - Drop unused legacy import: from aperag.tasks.collection import collection_task (no remaining call sites in this file) - Update module docstring to point at new Pattern B hook locations aperag/indexing/cleanup.py: - Add cleanup_expired_documents_hook() async helper (lazy import + asyncio.to_thread wrapper) wired into the existing 5-min run_cleanup_loop. Hook failures are logged + cycle continues. - Update module docstring to describe Pattern B integration alongside the original orphan-parse-version GC aperag/indexing/reconciler.py: - Add reconcile_collection_summaries_hook() async helper that inlines the legacy CollectionSummaryReconciler.reconcile_all() logic: reclaim stale GENERATING leases → PENDING; select PENDING summaries with version != observed_version; atomically claim each; fire collection_summary_task as Pattern C asyncio.create_task fire-and- forget background task (never blocks the loop on summary generation duration). Wired into existing 30-s run_reconcile_loop with best-effort try/except so hook failure cannot crash the loop. Tests: 132 passed (tests/unit_test/indexing/ + tests/load/); ruff check + format clean on all 3 modified files. Pre-existing test_phase3_reexport_audit.py circular-import error is unchanged (independent of this chunk; will resolve in chunk 2 when legacy aperag/domains/indexing/db/models.py is deleted). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…+ indexing layers + tablename rename Wave 3 hard-cut continuation per architect msg=3890c9d7 + PM @不穷 msg=313caed3 chunk 2 spec (delete-focused, intermediate red CI OK). DELETIONS (~3.5k LOC removed): - aperag/tasks/* — entire dir (collection / document / models / processing_lease / reconciler / scheduler / utils): legacy Celery state machine + reconciler + scheduler infrastructure - aperag/concurrent_control/* — entire dir (manager / protocols / redis_lock / threading_lock / utils + 2 READMEs): no remaining production caller after Wave 1+2 modality workers replaced lock semantics with per-row §F.1 partial-unique invariant - aperag/domains/indexing/{tasks,orchestration,manager,vector_index, fulltext_index,graph_index,summary_index,vision_index}.py + aperag/domains/indexing/db/models.py — legacy ABC + 5 modality workers + Celery orchestration + legacy DocumentIndex schema - config/celery.py — Celery app + beat schedule - tests/unit_test/concurrent_control/* + tests/unit_test/tasks/* — contract tests for now-deleted modules TABLENAME RENAME (matches existing alembic d0f4c1b9a8e2 post-state): - aperag/indexing/models.py: __tablename__ + 5 index names from *_v2 → canonical (no new alembic revision needed; the migration already does the rename at upgrade) AUDIT ALLOWLIST + 15-symbol map updates: - tests/unit_test/test_phase3_reexport_audit.py: drop WAVE_1_2_TEMPORARY_DUP_ALLOWLIST DocumentIndex entry; remap PHASE3_SYMBOL_TO_MODULE['DocumentIndex'] from aperag.domains.indexing.db.models → aperag.indexing.models; remove DocumentIndexStatus/DocumentIndexType (legacy enums gone, replaced by IndexStatus + Modality which are not Phase-3-canonical) - Add explicit aperag.indexing.models import after the per-domain bootstrap loop so Base.metadata['document_index'] is populated PYPROJECT — drop Celery deps: - celery<6.0.0,>=5.3.1 - django-celery-beat<3.0.0,>=2.5.0 (kombu was a transitive only; no explicit entry to remove) CONSUMER PATCHES (minimum to keep imports working — chunk 3 wires real new-API replacements): - aperag/domains/knowledge_base/service/document_service.py: stub document_index_manager + no-op _trigger_index_reconciliation - aperag/domains/knowledge_base/service/collection_summary_service.py: drop unused SummaryIndexer init - aperag/domains/retrieval/pipeline.py: stub _fulltext_search to return empty (Bryce T3.2 lane wires real aperag.indexing.fulltext backend) - aperag/domains/evaluation/tasks.py + services.py: drop @app.task decorator + asyncio.create_task fire-and-forget Pattern C - aperag/domains/knowledge_graph/tasks.py + graph_curation/service.py: same Pattern C migration CIRCULAR IMPORT FIXES (uncovered when stub re-exports were dropped): - aperag/indexing/__init__.py: drop keyword_extract re-exports (eager import pulled LLM completion stack mid-module-load); the 2 callers already import from aperag.indexing.keyword_extract directly - aperag/indexing/parser.py: lazy-import compute_parse_version inside parse_document body (was triggering full mcp.tools registry load) - aperag/indexing/keyword_extract.py: lazy-import db_ops inside LLM extractor body - aperag/domains/knowledge_base/db/models.py: lazy-import DocumentIndex + IndexStatus inside Document.{get_document_indexes, get_overall_index_status} method bodies (was triggering knowledge_base→indexing→mcp→knowledge_base cycle) GATES: - pytest tests/unit_test/indexing/ + tests/load/ + test_phase3_reexport_audit.py + agent_runtime_openapi_contract: 136 passed - Wider sweep (tests/unit_test/ excluding pre-existing missing-moto + just-deleted concurrent_control/tasks suites): ~896 passed, 4 failed (3 expected — Celery-specific assertions in evaluation_v2_worker / graph_curation that chunk 3 deletes; 1 format_drift caught + auto-formatted) - ruff check + format clean on all 13 modified .py files REMAINING FOR CHUNK 3: - Wire document_service.py 5 call sites + retrieval/pipeline.py fulltext to real new-API helpers - Selective deletion of legacy Celery-specific tests (evaluation_v2, graph_curation enqueue-raises path) - Final grep validation: from aperag.tasks / from aperag.domains. indexing / from celery / import celery = 0 hits in production - Alembic upgrade/downgrade smoke - task #14 → in_review Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…0 + alembic smoke + selective test delete Wave 3 hard-cut FINAL chunk per architect msg=3890c9d7 + PM @不穷 msg=de7b6834 + msg=fdb6cd28 chunk 3 spec. NEW MODULE — IndexingRuntime singleton: - aperag/indexing/runtime.py: process-local triple holder (engine + queue + workers) populated by FastAPI lifespan, consumed by service-layer code that doesn't have a Request handle for app.state. Tests can install a fixture runtime via set_runtime + reset. - aperag/app.py: lifespan calls set_runtime after building the triple; passes None on the sync-only branch + on shutdown. DOCUMENT_SERVICE — wire 5 callsites to new dispatcher + cleanup: - aperag/domains/knowledge_base/service/document_service.py: Replace the chunk-2 ``_DocumentIndexManagerStub`` with two real adapters: - ``_create_or_update_document_indexes`` → calls new ``aperag.indexing.dispatcher.dispatch_indexing()`` with deterministic ``parse_version`` (compute_parse_version on document.content_hash + canonical chunking config) + ``source_path = document.object_store_base_path()`` + tenant_scope_key per user. - ``_delete_document_indexes`` → calls new ``aperag.indexing.cleanup.cleanup_for_deleted_documents()`` (handles modality fan-out + DB row cleanup). Both adapters consume the IndexingRuntime singleton; if the runtime is absent (test environment / sync-only mode), they log a warning + no-op rather than crash. All 5 production callsites swapped: - line 532 create_documents - line 687 _delete_document - line 787 rebuild_document_indexes - line 831 rebuild_failed_indexes - line 1346 confirm_documents - ``_trigger_index_reconciliation`` stays as a no-op shim — the new ``run_reconcile_loop`` runs continuously every 30s. RETRIEVAL PIPELINE — inline ES fulltext search: - aperag/domains/retrieval/pipeline.py: ``_fulltext_search`` was a chunk-2 empty stub. Now executes the same ES query shape as the legacy ``FulltextIndexer.search_document`` — bool/should/match on content+title, filter by collection_id, optional chat_id filter — directly through ``AsyncElasticsearch`` (no longer wrapped in a domains/indexing/* class). T3.2 lane did not introduce a new search backend abstraction; the inline query against whatever ``aperag.indexing.fulltext.FulltextModality`` wrote is the canonical path. ALEMBIC env.py — drop deleted-module bootstrap import: - aperag/migration/env.py: remove ``import aperag.domains.indexing.db.models # noqa: F401`` (module hard-deleted in chunk 2). The canonical ``aperag.indexing.models`` import a few lines down already registers ``DocumentIndex`` against ``Base.metadata`` for autogen. SELECTIVE TEST DELETION (per architect msg=3890c9d7 Item 4): - tests/unit_test/test_es_p0_contract.py — DELETE (tested legacy ES ``aperag/domains/indexing/fulltext_index.py`` shape) - tests/unit_test/test_es_shared_index_rollout.py — DELETE (same) - tests/unit_test/test_evaluation_v2_worker.py: ``test_evaluation_run_service_launch_run_dispatches_celery_task`` removed (Celery-specific assertion; new path is asyncio fire-and-forget; the 13 ``test_execute_evaluation_run_*`` tests above lock the worker behaviour) - tests/unit_test/graph_curation/test_service.py: ``test_start_run_marks_failed_when_enqueue_raises`` removed (asyncio.create_task doesn't synchronously raise on schedule so the assertion no longer maps to reachable behaviour) LEGACY MIGRATION SCRIPT DELETED: - scripts/migrate_es_fulltext_shared_index.py — one-time Wave-1-era ES per-collection → shared rollout migration that referenced the hard-deleted ``aperag/domains/indexing/fulltext_index.py``. Not production runtime code; the rollout already happened. T3.2 CONTRACT TEST UPDATE: - tests/unit_test/service/test_search_graph_contract.py: ``test_search_result_metadata_is_public_allowlist`` add expected ``index_modality: "vision"`` field (Bryce T3.2 commit 5325788 §G.5 ``SearchResultMetadata.from_raw()`` derives it from ``indexer`` raw key — the test predates the schema extension and would have failed once T3.2 merged). GATES (FINAL HEAD): - ``grep "from aperag.tasks\|import aperag.tasks\| from aperag.concurrent_control\|from aperag.domains.indexing. (tasks|orchestration|manager|*_index|db.models)\|from config.celery\| ^from celery\|^import celery"`` over aperag/ + config/ + scripts/ → **0 hits in production code** ✅ - ``alembic upgrade head`` → succeeds (5 indexing migrations including T3.1 ``d0f4c1b9a8e2`` rename) ✅ - ``alembic downgrade -1`` then ``upgrade head`` → reversible round-trip ✅ - ``ruff check + format --check`` over aperag/ tests/ scripts/ → **clean** (491 files formatted) ✅ - ``pytest tests/unit_test/ tests/load/ --ignore=objectstore`` (objectstore needs moto extra, pre-existing) → **900 passed / 29 skipped / 0 failed** ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…source_path} to NOT NULL in model to match alembic d0f4c1b9a8e2 post-state CI ``alembic check`` (drift detector) caught a Wave-1-era stale model declaration. The migration ``d0f4c1b9a8e2`` correctly ALTERs both columns to NOT NULL (per architect msg=498b12f0), but ``aperag/indexing/models.py:108-109`` still declared ``Mapped[str | None] ... nullable=True`` from the original Wave 1 fixture-back-compat era. After ``alembic upgrade head`` the DB was NOT NULL but ``Base.metadata`` was nullable, so autogen wanted to emit ``ALTER COLUMN ... DROP NOT NULL`` to revert the DB. The PM directive (``msg=0dd76df9``) read the autogen log "Detected NULL on column" as "DB has NULL" and asked to add the ALTER NOT NULL to the migration; the migration already does that. The actual fix is to align the model with the migration's post-state (NOT NULL), not the other way around — Wave 3 lifted the back-compat the original ``nullable=True`` was protecting. Changes: - aperag/indexing/models.py:108-109: ``Mapped[str | None] ... nullable=True`` → ``Mapped[str] ... nullable=False`` for both columns + comment refresh pointing at the alembic NOT-NULL promotion - tests/unit_test/indexing/test_t2_1_runtime.py: ``test_reconciler_skips_pending_rows_missing_source_path`` deleted — the fixture ``_insert_row(... source_path=None)`` now raises IntegrityError before reconcile_pending_dispatch is ever called, so the scenario is unreachable from a clean schema. The defensive ``if not row.source_path`` branch in ``aperag/indexing/reconciler.py`` is kept as a zero-cost guard but no longer reachable without manual SQL bypass. Gates: - ``uv run alembic -c aperag/alembic.ini check`` → "No new upgrade operations detected" ✅ - pytest tests/unit_test/ tests/load/ --ignore=objectstore → 899 passed / 29 skipped / 0 failed ✅ - ruff check + format --check clean on the 2 modified files ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…adapter + drop celery infra CI e2e-http-provider caught two Wave-3-induced regressions on PR #1729 HEAD `5d50ca5`: **Blocker 1 — rebuild_indexes 500 DATABASE_ERROR**: The chunk-3 ``_create_or_update_document_indexes`` adapter calls ``dispatch_indexing()`` which INSERTs new ``document_index`` rows. ``rebuild_indexes`` re-invokes the adapter with the same ``(document_id, parse_version, modality)`` triple (content unchanged → parse_version unchanged), so the §F.1 ``uq_document_index_triple`` UNIQUE constraint fails the INSERT with IntegrityError → 500. Pre- DELETE matching rows (any status / serving state) before INSERT so the dispatcher's INSERT lands cleanly. The §F.3 cutover-on-sync- completion re-establishes the serving state once the new dispatch's worker finishes; brief unavailability between DELETE and cutover is acceptable for an explicit rebuild op. Test failure traced from `tests/e2e_http/hurl/full/11_document_full. hurl:204` POST `/api/v2/collections/.../documents/.../rebuild_indexes` expecting HTTP 200, getting 500. **Blocker 2 — celerybeat container `celery: not found`**: chunk 2 dropped ``celery`` + ``django-celery-beat`` from ``pyproject.toml`` and deleted ``aperag/tasks/`` + ``config/celery.py``, but the docker-compose ``celeryworker`` / ``celerybeat`` / ``flower`` services + helm chart ``celeryworker-deployment.yaml`` / ``celerybeat-deployment.yaml`` / ``flower-deployment.yaml`` + the ``scripts/start-celery-{worker,beat,flower}.sh`` entry scripts were left behind. CI e2e-aperag spins up the docker-compose stack, the ``celerybeat`` container tries to ``exec celery`` and fails (binary not in image since pyproject dropped the dep). The new in-process ``aperag.indexing`` runtime (worker pool + reconciler + cleanup loops) is spawned by the FastAPI lifespan inside the ``aperag-api`` container, so no separate worker / beat / monitoring pods are needed. DELETED: - docker-compose.yml: ``celeryworker`` / ``celerybeat`` / ``flower`` service blocks (replaced with explanatory comment block) - scripts/start-celery-{worker,beat,flower}.sh - scripts/test/celery-{call-task,with-local-queue}.sh - scripts/celery/trigger_trask.sh + the ``scripts/celery/`` dir - deploy/aperag/templates/celeryworker-deployment.yaml - deploy/aperag/templates/celerybeat-deployment.yaml - deploy/aperag/templates/flower-deployment.yaml - deploy/aperag/values.yaml: ``celery-worker`` + ``celerybeat`` + ``flower`` value blocks (replaced with explanatory comment) - deploy/aperag/templates/aperag-secret.yaml: ``CELERY_FLOWER_*`` env entries (no flower pod to consume them) - deploy/aperag/templates/_helpers.tpl: ``celeryworker.labels`` template (no chart consumes it) - deploy/aperag/values.yaml api podAffinity-with-celery-worker rule (the api pod no longer needs to co-locate with a non-existent worker pod; the soft anti-affinity for spreading api replicas across nodes is preserved) - deploy/aperag/templates/api-deployment.yaml: comment "shared uploaded files between api and celery" → "uploaded files volume consumed solely by the in-process ``aperag.indexing`` runtime" Local gates: - ruff check + format --check on the changed files → clean ✅ - pytest tests/unit_test/indexing/ tests/load/ test_phase3_reexport_audit.py → 133 passed ✅ Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rkflow + Makefile Wave 3 chunk 2 + 144c3f1 deleted the docker-compose ``celeryworker`` / ``celerybeat`` / ``flower`` services + helm charts, but a few infra-side scripts that explicitly referenced those service names were missed. CI e2e-http-smoke caught it: ``docker compose up -d celeryworker`` failed with ``no such service: celeryworker``. This PR plugs the four straggler call sites: - tests/e2e_http/runners/compose/up.sh:8: ``E2E_COMPOSE_SERVICES`` default drops ``celeryworker celerybeat`` → just ``postgres redis qdrant es api``. The api container's FastAPI lifespan spawns the in-process indexing runtime, so no separate worker container. - tests/e2e_http/scripts/provider_diagnostic.sh:63: failure-diag log-dump loop drops ``celeryworker celerybeat`` from the service list. - .github/workflows/e2e-http-smoke.yml:68,173: ``docker compose logs`` in the failure-dump steps drops ``celeryworker celerybeat``. - Makefile: deleted ``serve-worker`` / ``serve-beat`` / ``serve-flower`` targets + their help-string entries (the binaries are gone since pyproject dropped ``celery``). Local sanity: ``grep -rn 'celery|celerybeat|celeryworker|flower' tests/ e2e_http/ .github/ Makefile docker-compose.yml deploy/`` returns only explanatory comment lines (the in-process runtime replacement narrative); no live service / command references remain. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…with ProductionWorkerFactory + harden orchestrator factory-error path Wave 3 hard-cut deleted the legacy Celery indexers but left the FastAPI lifespan wiring ``run_*_worker`` with a placeholder factory that raised ``NotImplementedError`` on every dispatch. e2e-http- provider stalls on ``wait_for_document_indexes`` because the row never advances past PENDING (PM msg=dc13c4a5 root cause). Per architect msg=7782ebe0 spec lock: - ``aperag/indexing/worker_factory.py`` (new): per-task lazy ``ProductionWorkerFactory`` resolving ``Collection`` from the payload, building the right ``ModalityWorker`` with real Qdrant / Elasticsearch backends + the configured embedder / completion model. Composes existing helpers (``get_collection_embedding_service_sync`` / ``get_vector_db_connector`` / ``get_object_store`` / ``build_collection_llm_callable``) so this is wiring, not re-implementation. Failures raise ``WorkerFactoryError`` so the operator gets a meaningful ``error_message``. Graph modality is intentionally minimal (in-memory lineage store + no-op extractor) pending Wave 4 Nebula-side §D.3.6 lineage adapter — documented as a known gap, not a regression; the e2e-http-provider gate only blocks on vector ACTIVE. - ``aperag/indexing/orchestrator.py``: harden ``_runner`` to claim the row + finalise FAILED on factory error so the §I.2 retry- with-backoff schedule kicks in. Without this, factory errors got silently swallowed by the asyncio.Task and the row sat at PENDING forever. - ``aperag/app.py``: replace the placeholder closure with a ``ProductionWorkerFactory`` instance. - ``tests/integration/test_worker_factory.py``: 3 tests pinning factory-failure → FAILED-finalize, collection-not-found path, and missing-collection-id path. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (+3 from this commit). ruff check + format --check clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…al to §F.2 4-state + drop SKIPPED sentinel + skip vector when disabled Per architect msg post-pass-5 + PM msg=79683cc0 ruling. Two e2e-http- smoke bugs surfaced after the worker_factory wire-in lands: **Bug 1 — Pydantic 400 on GET document.** orchestrator claims a row to ``RUNNING`` (the §F.2 canonical 4-state) before the worker finishes; the ``Document`` view model's per-modality status Literal still listed the legacy 6-state vocabulary (``CREATING``/``DELETING``/``DELETION_IN_PROGRESS``/``SKIPPED``) which never includes ``RUNNING`` — so any GET racing the claim returned ``ValidationError``. The Wave 3 hard-cut migrated the DB enum but missed this view-model layer (CR step-0 lesson #6: schema-touching PR must trace enum references through every deserialise surface, not just the write path). The fix collapses the 5 per-modality status Literals to the §F.2 4-state ``Optional[Literal["PENDING", "RUNNING", "ACTIVE", "FAILED"]]``. "Modality not enabled" is now expressed by the field being absent (``None``) rather than the sentinel ``"SKIPPED"`` — the row simply does not exist in ``document_index``. Friendly client-facing mapping (``NOT_ENABLED`` / ``INDEXING``) lives in §G.5 ``SearchResultMetadata.index_state_per_modality`` for the read-path response. **Bug 2 — collection without embedder triggers FAILED loop.** ``_get_index_types_for_collection`` always added ``Modality.VECTOR`` regardless of the collection's ``enable_vector`` flag. A collection without an embedding-model config (smoke test fixture) then dispatched a vector job, the production worker factory raised ``WorkerFactoryError`` (no embedder), the orchestrator finalised ``FAILED``, the reconciler re-dispatched, repeat. The fix honours ``enable_vector`` symmetric with ``enable_fulltext``: explicitly disabled means no row in the document_index table for that modality. Files: - ``aperag/domains/knowledge_base/schemas.py``: 5 status fields → ``Optional[Literal["PENDING", "RUNNING", "ACTIVE", "FAILED"]]`` - ``aperag/domains/knowledge_base/service/document_service.py``: ``_build_document_response`` returns ``None`` when index row missing (instead of ``"SKIPPED"``); ``_get_index_types_for_collection`` honours ``enable_vector`` flag. - ``tests/e2e_http/hurl/{smoke/03_document_basic,full/11_document_full}.hurl``: 6 assertions migrated from ``== "SKIPPED"`` to ``== null``. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (unchanged from 579b32a). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ne + drop asyncio.to_thread caller wrap Wave 3 chunk 2 Pattern C migration moved 5 ``.delay()`` callsites to ``asyncio.create_task(asyncio.to_thread(run_evaluation_run, run_id))``, but ``run_evaluation_run`` was still a sync wrapper that called ``asyncio.run(execute_evaluation_run(run_id))`` inside the worker thread — spawning a *fresh* event loop each invocation. Any asyncpg connection borrowed by ``execute_evaluation_run`` is bound to the FastAPI lifespan loop's connection pool; running the coroutine on a brand-new loop made every connection-pool ``ping`` fail with ``RuntimeError: got Future attached to a different loop``, which corrupted the asyncpg shared pool. Subsequent DB calls from unrelated code paths (every later e2e-http-provider hurl test that touched Postgres) tripped the same error → CI exit 1 (per huangheng pass-6 followup msg + PM msg=37da5249). Fix per huangheng option (a): * ``aperag/domains/evaluation/tasks.py``: ``run_evaluation_run`` becomes ``async def``, awaits ``execute_evaluation_run`` directly. No fresh loop. Docstring spells out the failure mode so a future reader does not regress. * ``aperag/domains/evaluation/services.py``: caller drops ``asyncio.to_thread`` and schedules the coroutine directly via ``asyncio.create_task(run_evaluation_run(run_id))``. The task shares the FastAPI lifespan loop, keeping asyncpg pool affinity. Pattern C contract preserved (fire-and-forget at the request handler boundary); only the inner mechanism changes from "thread + new loop" to "coroutine on shared loop". The other 4 ``.delay()`` callsites in chunk 2 were genuine sync work and stay on ``asyncio.to_thread`` — only evaluation's body was async-native under the hood, which is why this was the one that blew up. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 909 passed / 41 skipped / 0 failed (unchanged). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…or Pattern C in-process dispatch CI run 24976479158 (PR #1729 head e1f2325) failed at ``16_evaluation_v2.hurl:218`` with the assertion ``$.items[0].status == "queued"``; the actual response showed ``status="running"`` because the post-pass-7 evaluation cross-loop fix (e1f2325) made dispatch effectively immediate — the ``asyncio.create_task(run_evaluation_run(run_id))`` worker starts on the next event-loop tick, so by the time the GET arrives the run has already left "queued". The test was written for Celery ``.delay()`` semantics where "queued" was a stable, externally-observable transient state thanks to broker round-trip + worker pickup latency. Pattern C in-process collapses that latency to microseconds, so "queued" is no longer reliably observable on a follow-up GET. Fix per huangheng option (a) + PM ack: relax 4 timing-sensitive assertions to accept any in-flight or terminal state via ``matches "^(queued|running|completed|cancelled)$"`` (item status uses the correspondingly-broader ``pending|...|failed|cancelled``). The contract this test pins is "the run shows up correctly in list / detail endpoints with the right ids", not "dispatch is slow enough to observe a specific transient state". POST-response asserts (lines 183, 207) keep the strict ``status == "queued"`` value because those are synchronous returns built before the ``create_task`` fires. Also relaxes: - ``summary.pending == 3`` → drop (kept ``summary.total == 3``, which is fixed by dataset cardinality) - ``progress.percent == 0`` → drop (now race-window-dependent) - ``items[0].status == "pending"`` → matches in-flight set - ``items[0].attempt_count == 0`` → drop (worker may have attempted already) - ``attempts body contains "items":[]`` → ``$.items exists`` (envelope shape only, ignore population timing) Local gates: pytest 161 passed (evaluation worker + openapi contract + indexing + integration + load + phase3 audit). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…patch_indexing in document_service PR #1729 head 30b3489 e2e-http-provider failed at the scripted ``run_chat_collection_flow.sh`` business flow because vector + fulltext modality workers reported "found no chunks at user-X/colY/docZ; treating as derive-incomplete and skipping" on every claim — the chunks.jsonl artifact never existed at the dispatcher's ``source_path``. Root cause (architect msg=c605037e ruling): chunk 2's hard-cut deleted ``aperag/domains/indexing/{tasks,orchestration,manager, *_index}.py`` whose former ``process_document_task`` ran :func:`aperag.indexing.parse_document` and wrote the canonical ``derived/parse_<v>/{markdown.md,outline.json,chunks.jsonl}`` artifacts before enqueuing modality workers. The new dispatch path never picked up that responsibility — every modality worker.derive pulled an empty derived path and the row stayed in the §C.7 reschedule loop forever. Fix per architect option (1) — Wave 3 minimal scope, not skip: ``aperag/domains/knowledge_base/service/document_service.py`` ``_create_or_update_document_indexes`` now: 1. Resolves the upload object path from ``document.doc_metadata.object_path`` (the upload handler already stashes it there). 2. Reads the source bytes from the object store on a worker thread. 3. Calls :func:`parse_document` synchronously on a worker thread so the canonical ``derived/parse_<v>/`` artifacts exist before any modality dispatch. 4. Uses ``parsed.parse_version`` and ``parsed.chunks_path`` as the dispatcher's parse_version / source_path (replaces the previously-computed-locally values that pointed at the document base prefix, not the chunks.jsonl file). This keeps §E.2 "parse-as-first-stage" intact; the parse step runs inside the request task instead of a separate ``parse_worker`` queue process. Wave 4 follow-up may promote parse to ``q:parse`` once observed parse latency starts blocking HTTP requests; the sync path is acceptable for current latencies. Parse failure raises and propagates → HTTP 500 → no modality rows created (per architect ruling: "fail loudly, no half-state"). New integration test ``tests/integration/test_dispatch_with_parse.py`` pins the canonical post-fix flow: parse first → dispatch with chunks.jsonl path → modality workers reach ``status=ACTIVE`` AND ``is_serving=TRUE`` (uses ``IndexingMode.INLINE`` so no lifespan / async queue needed; the same data-flow contract). Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 910 passed / 41 skipped / 0 failed (+1 new test). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…d in worker_factory PR #1729 head 8ca396f e2e-http-provider failed with vector worker hitting Qdrant 400 ``Format error in JSON body: value f766a946575ec3b4:0000 is not a valid point ID``. Qdrant only accepts unsigned-integer or UUID point ids; the T1.1 parser produces chunk ids of the form ``<sha-prefix>:<index>`` (16 hex + ``:`` + 4-digit) which violate that constraint. Fulltext reaches ACTIVE because Elasticsearch is happy with any string ``_id``. Vector hits 400 on the very first ``client.upsert(...)`` call. Fix in ``aperag/indexing/worker_factory.py _QdrantPointBackend.upsert_point``: map the caller-supplied chunk_id to a deterministic ``uuid.uuid5(NAMESPACE_OID, chunk_id)`` for the Qdrant point id (stable across retries → idempotent §D.1) and stash the original id in the point payload so the read path can still echo it to clients. Vector / summary / vision share the same ``_QdrantPointBackend`` adapter so this fix covers all three modalities. Local gates: pytest tests/integration/test_worker_factory.py tests/integration/test_dispatch_with_parse.py tests/unit_test/indexing/ tests/load/ = 135 passed. ruff clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pplement (#1730) Per @earayu2 / @不穷 msg=981ae30d standby task: supplement the existing per-modality acceptance tests in ``test_t1_2_graph.py`` / ``test_t1_3_vector_fulltext.py`` / ``test_t1_4_summary_vision.py`` with **cross-modality contract tests** that exercise invariants the spec promises across all 5 modalities. Per @不穷 msg=0d35f537 scope decision: this lands as a follow-up PR (NOT into the current Wave 3 PR #1729). The branch is based on ``origin/main`` and verified passing against the pre-Wave-3 codebase; no Wave 3 dependency. Coverage added (11 tests): 1. **§C.7 reschedule contract** (2 tests): summary + vision ``derive()`` with a missing upstream source returns ``DeriveResult(derived_artifact_path="")`` (the empty-string signal the orchestrator interprets as "derive incomplete, leave PENDING for next reconciler cycle"). Vector + fulltext are pass-through and covered by existing per-modality "no-op on missing chunks" tests. 2. **N-call replay convergence** (4 tests, parametrized across vector/fulltext/summary/vision): 5 consecutive ``sync()`` calls produce a backend state byte-equivalent to a single sync (extends existing 2-call tests under arbitrary retry storms — §D.4). 3. **Cross-document parse_version isolation** (4 tests, same params): ``sync()`` of doc-A's ``(doc_a, parse_version)`` slot must NOT touch doc-B's backend state. Locks the §D.1 DELETE scope: ``WHERE document_id=A AND parse_version=V`` only. Uses two distinct doc bodies because the parser's content-hashed ``chunk_id`` would otherwise collide cross-doc — fixture limitation, not a production-code invariant violation; called out in the test docstring. 4. **All-5-modality enum discriminator** (1 test): each ``ModalityWorker`` subclass binds the class-level ``modality`` attribute to the matching ``Modality`` enum value (orchestrator route key — a misbind would silently mis-route work). Graph modality is intentionally NOT in the parametric sweeps because ``test_t1_2_graph.py`` already covers the §D.3 lineage semantic exhaustively (D3.6 5-step scenario + Nebula race + byte-equivalent re-sync + tenant_scope_key propagation). The all-5-modality enum test is the only graph touch here. Local gates: - pytest tests/unit_test/indexing/test_modality_worker_contract.py → 11 passed - ruff check + format --check clean Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

…y + parser markdown-only docs Per architect msg=c79e9a3f gap-report ruling: Wave 3 ships production-ready vector + fulltext + summary + vision modalities; graph and the binary-format parser path are explicitly gated until Wave 4. The previous fix-cycle was patching the visible layers (alembic / Celery residue / view model / hurl timing / cross-loop / Qdrant id format) while the real production-readiness gaps (InMemory graph store, no-op extractor, simulator parser) sat under the surface. Without this gate the graph rows would reach ``status=ACTIVE`` with zero entities written — silent broken UX for graph search. Four changes: * ``aperag/indexing/worker_factory.py``: ``_build_graph_worker`` now detects the :class:`InMemoryLineageGraphStore` placeholder and raises :class:`WorkerFactoryError` with an explicit message pointing at ``enable_knowledge_graph=false``. The orchestrator runner already finalises factory errors as ``FAILED`` with the message persisted to ``error_message`` — operators / clients see a clear refusal rather than a misleading ACTIVE-with-empty graph. When Wave 4 swaps in a Nebula adapter, the ``isinstance`` check naturally stops matching and the gate self-disables. * ``aperag/schema/common.py``: ``CollectionConfig.enable_knowledge _graph`` default flipped from ``True`` to ``False`` so new collections do not opt into the gated path by accident. Wave 4 release flips it back. * ``docs/private-deployment.md``: adds a "Wave 3 release scope" section before tier selection, naming both gates explicitly (graph backend + extractor; markdown-only parser) and the Wave 4 backlog that lifts them. * ``tests/e2e_http/scripts/run_full.sh``: skips ``run_graph_index_flow.sh`` until Wave 4 with a comment pointing at the architect ruling and the docs section. The graph e2e flow would otherwise time out on the explicit ``WorkerFactoryError`` finalising the row to FAILED instead of reaching ACTIVE, masking the legitimate scope cut behind a generic CI red. This is the closing commit for PR #1729 per architect's "no more fix-cycle" lock; everything past this point is Wave 4 backlog. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 910 passed / 41 skipped / 0 failed (unchanged from a11df3c). ruff check + format clean. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…chema + gate vision modality to Wave 4 Per architect msg=69df0779 closing ruling on the two production gaps Bryce diagnosed (msg=8953bd05) after the previous closing attempt (4b0eaf3) still tripped run_chat_collection_flow.sh: **Finding 1 — fulltext field-shape mismatch** (real production bug, Wave 3 must fix): ``aperag/domains/retrieval/pipeline.py:_fulltext_search`` queries the legacy ES schema (``content``/``title``/``collection_id`` filter — the shape the now-deleted ``aperag/domains/indexing/fulltext_index.py`` wrote pre Wave 3 hard-cut). The Wave 1 simulator's ``FulltextModality.sync`` wrote ``text`` and no ``collection_id``, so every fulltext search after Wave 3 returned 0 hits silently. The chat-flow business test exited on ``jq -e items.length > 0``. * ``aperag/indexing/fulltext.py``: ``FulltextModality.__init__`` takes an optional ``collection_id`` (kept optional so existing in-memory contract tests continue to work without it). ``sync`` now writes ``content`` (queried by ``_fulltext_search``) and ``collection_id`` (filtered by ``_fulltext_search``) into every chunk record. ``text`` is kept as an alias of ``content`` so existing in-memory backend assertions do not regress. * ``aperag/indexing/worker_factory.py``: ``_build_fulltext_worker`` passes ``collection.id`` so production rows always carry the filter field. * ``tests/integration/test_fulltext_roundtrip_fields.py`` (new): pins the post-fix invariant — every document the ``FulltextModality.sync`` writes carries the field names the retrieval pipeline depends on (``content`` + ``collection_id``). A regression that drops either trips the first assertion. **Finding 2 — vision modality is fake** (silent broken, gate to Wave 4): The previous ``_build_vision_worker`` built ``VisionModality`` with ``embedding_service.embed_query(f"{image_id}|{alt_text}")`` — a text embedding on a string-concat. That produces deterministic per-image vectors with no actual image-content awareness. Same silent-broken pattern as the gated graph modality (architect ruling msg=c79e9a3f); same Wave 4 gate is the correct response. * ``aperag/indexing/worker_factory.py``: ``_build_vision_worker`` now requires ``embedding_service.is_multimodal()`` to be ``True`` and otherwise raises :class:`WorkerFactoryError` with the same shape as the graph gate. ``CollectionConfig.enable_vision`` is already ``False`` by default, so the gate only fires when an operator explicitly opted in. When Wave 4 ships a real multimodal embedding model + the operator configures it, the ``is_multimodal`` check passes and the gate self-disables. * ``docs/private-deployment.md``: "Wave 3 release scope" section adds a vision-modality gate paragraph alongside the existing graph and parser-markdown-only paragraphs. Wave 4 backlog item #9 is the real multimodal vision-LLM wiring. Local gates: pytest tests/unit_test/ tests/integration/ tests/load/ --ignore=tests/unit_test/objectstore = 912 passed / 41 skipped / 0 failed (+2 new roundtrip tests). ruff check + format clean. This is the (real) closing commit for PR #1729 per architect's "no more fix-cycle" lock — every remaining gap is now in the Wave 4 backlog (9 items, hard-locked). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The license-check infrastructure (`check-license`, `install-addlicense`, `install-hooks`, the bundled `addlicense` binary, and the git hooks install) was already removed in commit 00ae644 (PR #1729). The `add-license` target left behind only echoes a one-line message and does nothing else, so it is dropped here along with its `.PHONY` entry. License headers in source files are unchanged. Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com>

按 task #17 PR 协作 review 模式，fold-in 冬柏 msg=d56bb0f7 review 反馈： 1. §二 6 hard gate 加具体测试文件 mapping 子表（每个 gate 有可执行 test 文件锚点 + 默认 owner，CR verdict 表 §七要求填具体 test commit / 行号） 2. §五加 5.5 失败注入方法规范（kubectl scale / iptables drop / kubectl delete pod 真路径 vs 禁止 mock 客户端绕过；Lesson 沉淀来源 Wave 3 PR #1729 mock 路径过 + 真路径 fail） 3. §七 verdict 表新增 3 行（多文档并发压测阈值 by Planetegg #22 + 黄章书 / smoke regression diff = 0 对照 6 套 hurl baseline / 失败注入用真路径不允许 mock 引用 §5.5） 4. §八加 checklist 修订记录（追踪后续 fold-in） verdict 中文化（§5.2）已 align 冬柏建议（✅ 同意通过 / ⏳ 阻塞 / 💡 小修建议），无需调整。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@ziang

* docs(task-17): 提交 task #17 代码改造方案章节进仓库 (协作 PR base) 按 earayu2 directive msg=53a0b121 (开 PR) + msg=d18b6887 (所有人可改 PR) + ziang msg=4ea65100/76f6f465 (代码改造文档必须入仓不能引用 agent 本地路径) + 团队 5 方共识 v8 final (候选 B + 一次性 hard cut + 解 503 最小改造). 本提交是协作式 PR 的初始 base, 含 Bryce 写的 §3 file-by-file 代码改造细节 (后 ziang msg=4ea65100 7 项 CR 修正 — 实施时按 ziang 简化版走, 见 commit 后续 push). v8 final 整体方案 (§1/2/4-12) 由架构师 push 进来; 部署 / 发布 / 回滚 (§4-6) 由 huangzhangshu push; 状态机 / 失败场景 (§7) 由 ziang push; Helm + 代码 commits 由 Bryce 按 ziang 7 项简化版逐文件 push. 不进 task #17 主线 (按 Weston msg=d2c46eb7 + ziang msg=cd4761aa 收紧): - 包名 ``aperag/tasks/`` 大搬迁 - task #18-21 (collection 配置校验 / graph_extractor fail-loud / 图谱 GC sweep / store bulk upsert API) 作独立切片候选, 不默认进 task #17 - Dramatiq/Celery/RQ 引入 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): add v8 final architecture spec — task system hard cut 按 5 方独立 evidence-based 共识 + 全部 BLOCKER 收紧 + ziang msg=bb5a9e2e 最后修正（删除外部 agent 路径引用，改成本 PR diff 引用）后的 ratify-ready v8.2 final 版本归档进 docs/zh-CN/architecture/task-system-hard-cut-v8.md。包含： - Executive Summary 推荐候选 B + Reject Cards (A/C/D) - §1-2 上下文 + 不可变 hard gate (黄章书 7 条 + 2 层 invariant) - §3 代码改造 8 项 runtime 主线（不含 module 搬迁、不含 hardening） - §4 部署改造 (huangzhangshu §4) - §5 发布计划 8 步 - §6 回滚策略 + 双执行面 hard gate - §7 状态机/失败场景验收 (ziang §7) - §8 架构评审 8 块 review checklist (Weston) - §9 Spec lock fold (6 YAGNI + 4 escape hatch + PgBouncer 后续选项) - §10 CR mandatory checklist (5 cross-check + Lesson #11/12/extension v3 + Mini-pattern 17) - §11 节奏 + 团队分工 - §12 待 earayu2 ratify Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs: add task 17 deployment release runbook * docs(task-17): add state machine validation gates * docs(task-17): add architecture invariants (task system) 锁定 ApeRAG 异步任务系统不可变 invariants 进 architecture 文档: - 部署架构 invariant (API/worker 进程隔离 + probe 不放大连接池 + 连接池公式 + 回滚执行面唯一性) - 业务状态 invariant (DocumentIndex SoT + cleanup intent SoT 在 DB + API 不拥有重型执行面 + 旧任务防写回) - 任务系统选型 invariant (6 YAGNI + 4 escape hatch + PgBouncer 后续) - CR mandatory checklist (5 cross-check + Lesson #11/12/extension v3 + Mini-pattern 17 + 6 hard gate) 未来 PR review 必引用本文档防 incremental drift。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): add huangheng CR mandatory checklist 按 earayu2 msg=c1c4ba2f directive「每个人补充文档」+ task #17 PR collaboration draft 协作要求，新增 huangheng CR 角色专属 mandatory checklist 文档 docs/zh-CN/architecture/task-17-cr-review-checklist.md。包含： - 5 条 cross-check（粒度等量 / 节间一致 / 数字合理 / framework claim 分级 / 推荐 evidence-grounded） - 6 条架构 hard gate（ziang msg=4ea65100 + msg=ad6a610d 收敛版：API 不启重型执行面 / cleanup intent 真源 DB / object store delete 迁出 / API readiness 轻量 / 连接池 Helm 层映射 / 回滚执行面唯一） - 7 条实现修正（ziang msg=4ea65100 + msg=76f6f465 + Bryce msg=981960cd accept 版：settings 现有 module 实例 / QuotaPolicyRegistry 直接创建 / 连接池 Helm 映射 / 删除 helper 不嵌套 transaction / object store delete 迁出 / cleanup loop 补 deleted Document scan / diagnostics 鉴权 + sync URL） - CR 必应用 lessons（Lesson #11 / Lesson #12 / extension v3 / Mini-pattern 17 / 一次性不分阶段） - CR 工作流（PR 上线后 CR 顺序 + verdict 表述规范 + false positive 自我修正 + 团队协作边界） - CR 历史 sediment 引用 - task #17 PR final review verdict 表（待 PR 合并前填） Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): align scope and architecture doc links * deploy: add indexing worker helm topology * docs: clarify quota scope for task 17 deployment * feat(task-17): hard cut API/worker — 新增 indexing-worker CLI + 删 lifespan worker startup (#20) 按 task #17 v8.2 final + ziang msg=4ea65100 7 项 CR 修正实施: 1. 新增 ``aperag/cli/__init__.py`` + ``aperag/cli/indexing_worker.py`` - ``python -m aperag.cli.indexing_worker`` 独立进程入口, Helm ``indexing-worker-deployment.yaml`` (huangzhangshu commit f4be52f) 调用 - 跟 ``aperag/app.py`` 老 lifespan 行为完全等价: 选 queue (redis/inmemory) / quota backend (RedisQuotaBackend(quota_redis, QuotaPolicyRegistry())) / metrics emitter (otlp/noop) 同款 dispatch - 启动 10 个 asyncio 后台任务: 7 modality worker (vector/fulltext/graph/ graph_facts/graph_vectors/summary/vision) + parse + reconciler + cleanup - SIGTERM/SIGINT graceful shutdown: 等所有 in-flight 任务 drain + 关闭 RedisWorkQueue / quota redis client 2. ``aperag/app.py`` lifespan hard cut - 删除所有 ``asyncio.create_task(run_*_worker)`` (vector/fulltext/graph/ graph_facts/graph_vectors/summary/vision 7 个 modality worker) - 删除 ``run_parse_worker`` / ``run_reconcile_loop`` / ``run_cleanup_loop`` 启动 - 删除 ``ProductionWorkerFactory(engine=engine)`` 构造 (API 不需要消费 task) - 删除 ``indexing_runtime_tasks`` list + ``indexing_shutdown`` event (没有 task 需要 drain) - 保留: queue (push_parse / dispatcher 用) + quota_backend (API 检查租户配额) + metrics_emitter (上报 SLI) + IndexingRuntime singleton (service 层 enqueue) - ``IndexingRuntime.cleanup_worker_factory=None`` (per ziang msg=cecb0d88 hard gate: API 请求路径不能直接调 ``cleanup_for_deleted_documents`` / ``delete_objects_by_prefix``, cleanup 由 worker cleanup loop 异步执行) - 老 ``/health`` endpoint 暂保留 (line 497-501), task #21 (@申栋栋) 实施 ``/health/live`` / ``/health/ready`` / ``/health/diagnostics`` 时改造新加坡 503 根因: API + 重型 indexing worker 共进程, graph 索引压力把 ``/health`` / 事件循环 / 线程池 / DB 连接池一起拖死, kubelet 杀 pod, ALB 503. 本 commit 把 worker 进程从 API lifespan 拆出来 — API pod 只做 HTTP 路由 + 轻量入队, ``/health`` 不再受 worker 资源压力影响. ziang msg=4ea65100 7 项 CR 修正 fold-in: 1. ✅ 用现有 ``settings`` (module-level), 不引 ``get_settings()`` helper 2. ✅ ``ProductionWorkerFactory`` 从 ``aperag.indexing.worker_factory`` import 3. ✅ ``RedisQuotaBackend(quota_redis, QuotaPolicyRegistry())`` / ``InMemoryQuotaBackend(QuotaPolicyRegistry())`` 跟 app.py 现有写法一致 4. ✅ 连接池只在 helm 层映射现有 ``DB_POOL_SIZE``/``DB_MAX_OVERFLOW`` (huangzhangshu commit f4be52f 已 push), 不引应用层双 env alias 5. ⏳ ``_delete_document`` 不嵌套 transaction (留 task #17.4 @明书 cleanup 迁出 pair 处理) 6. ⏳ Object store prefix delete 也迁出 (留 task #17.4 处理) 7. ⏳ ``run_cleanup_loop`` deleted Document scan 补全 (留 task #19 @ziang 处理) 8. ⏳ ``/health/diagnostics`` 鉴权 + sync URL (留 task #21 @申栋栋处理) ⏳ 项不在 task #20 (#17.1+17.2) scope 内 (huangheng msg=f97b7c5f 6 lane 协作分工). 本 commit 严格只覆盖 cli + lifespan 改造, 不越界. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(task-17): co-implementer add multi-doc burst e2e + probe-layering smoke Sub-task #22 co-implementer contribution (chenyexuan; Planetegg main owner) covering two non-overlapping slices that the deployment hard cut needs end-to-end coverage for: 1. ``tests/load/test_concurrent_doc_upload_e2e.py`` — multi-document concurrent HTTP upload + index ACTIVE assertion. Skipped by default (``RUN_TASK_17_E2E=1`` to enable). Runs DOC_COUNT documents in parallel through the public API, polls until vector / fulltext / graph indexes all reach ACTIVE within ``POLL_BUDGET_SECONDS``, and samples ``/health/live`` every 2s during the burst — every sample must be 200 with p95 latency under 500ms, which is the cascading- failure-mode that the API/worker hard cut prevents. 2. ``tests/e2e_http/hurl/smoke/00_health.hurl`` — extend existing smoke with the probe-layering pins for sub-task #21 (申栋栋 main owner): ``/health`` (compat), ``/health/live`` (liveness, must not touch PG/Redis/Qdrant), ``/health/ready`` (readiness, must not consume main DB pool). ``/health/diagnostics`` is intentionally NOT asserted in smoke per Planetegg msg=64f33ceb — that endpoint is gated by admin auth / intranet and is covered from Planetegg's release-validation lane. Coordination trail: thread #indexing优化:5e959a2d msg=abccc676 (split proposal) → msg=3a7953a6 (Planetegg confirms split) → msg=64f33ceb (diagnostics nit absorbed). Cross-checks against task #18 (黄章书 deployment) and task #21 (申栋栋 health endpoints) — both still in flight; this commit pins the contract callers will assert against once those land. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * docs(task-17): fold-in 冬柏 msg=d56bb0f7 4 条测试维度补充按 task #17 PR 协作 review 模式，fold-in 冬柏 msg=d56bb0f7 review 反馈： 1. §二 6 hard gate 加具体测试文件 mapping 子表（每个 gate 有可执行 test 文件锚点 + 默认 owner，CR verdict 表 §七要求填具体 test commit / 行号） 2. §五加 5.5 失败注入方法规范（kubectl scale / iptables drop / kubectl delete pod 真路径 vs 禁止 mock 客户端绕过；Lesson 沉淀来源 Wave 3 PR #1729 mock 路径过 + 真路径 fail） 3. §七 verdict 表新增 3 行（多文档并发压测阈值 by Planetegg #22 + 黄章书 / smoke regression diff = 0 对照 6 套 hurl baseline / 失败注入用真路径不允许 mock 引用 §5.5） 4. §八加 checklist 修订记录（追踪后续 fold-in） verdict 中文化（§5.2）已 align 冬柏建议（✅ 同意通过 / ⏳ 阻塞 / 💡 小修建议），无需调整。 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * deploy: pass postgres env to indexing worker * test(task-17): add local stability acceptance helper * feat(task-17): add split health endpoints * refactor(task-17): mount health router under prefix * test(task-17): add hard gate #1 grep gate (API lifespan no workers) The task #17 hard cut moved every ``run_*_worker`` / ``run_*_loop`` entrypoint off ``aperag/app.py`` and onto ``aperag/cli/indexing_worker.py`` to fix the Singapore 503 root cause (API + worker sharing one process, graph indexing pressure starving ``/health``). Pin that contract at the source level so a future PR cannot re-merge the runtimes by accident. The new ``tests/unit_test/test_app_lifespan_no_workers.py`` runs in the standard PR-gate suite (no deployment) and asserts: Negative on ``aperag/app.py``: - No invocation of any of the 10 worker entrypoints (vector / fulltext / graph / graph_facts / graph_vectors / summary / vision / parse / reconciler / cleanup). - No ``ProductionWorkerFactory(...)`` construction. - ``IndexingRuntime.cleanup_worker_factory`` only ever assigned ``None`` (per ziang msg=cecb0d88 + huangheng msg=f97b7c5f #6 hard gate forbidding API request path heavy backend cleanup). Positive on ``aperag/cli/indexing_worker.py``: - All 10 worker entrypoints invoked (the hard cut is symmetrical: what the API stops doing, the worker must start doing — otherwise the deployment boots but consumes nothing). - ``ProductionWorkerFactory(...)`` constructed worker-side. This is the executable companion to huangheng's ``task-17-cr-review-checklist.md`` hard gate #1 (per cuiwenbo msg=f7868d2c hard-gate-to-test-file mapping). Owner nominally @bryce per the checklist; @chenyexuan co-implementer covers this slice (msg=6627eb69) so Bryce can focus on the upcoming task #17.4 ``document_service.delete`` cleanup migration. Verified: 5/5 tests pass against current PR branch head. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): avoid legacy health redirect * chore(task-17): clean whitespace in docs and health files * docs: document task 17 pool budget defaults * test(task-17): assert legacy health does not redirect * fix(task-17): align health hurl smoke with dongdong's actual response shapes After dongdong's task #21 commits d5a2ff9 + 0c8b03b + b65f470, the real per-endpoint ``status`` values differ from what my initial hurl extension (commit f5ba44d) had pinned: - ``/health`` → ``{"status": "healthy", "service": "aperag-api"}`` - ``/health/live`` → ``{"status": "live", "service": "aperag-api"}`` - ``/health/ready`` → ``{"status": "ready", "service": "aperag-api"}`` The original extension copy-pasted the legacy ``healthy`` value across all three. Match the real shape now that #21 is in review (cuiwenbo ✅ msg=5b7b4892, Weston ✅ msg=f0b020b4) so the smoke does not break the moment the deployment lands. Also pin the ``service`` field on each endpoint to catch any future regression that drops the service-id label. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): fold worker shutdown drain + service.delete cleanup - indexing_worker: 25s asyncio.wait_for + cancel fallback so a stuck loop cannot hold the kubelet 30s grace into SIGKILL; flush OTLP MeterProvider on exit so PeriodicExportingMetricReader samples are not dropped (matches app.py finally semantics); reword the quota_backend / metrics_emitter retain-comment so the intent is legible (real acquire wires up in task #24). - document_service: drop the orphan _delete_document_indexes helper — its only consumer (_delete_document) now leaves cleanup to the worker cleanup loop, so the helper would otherwise still pull cleanup_for_deleted_documents into the API request path and break the hard gate. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * fix(task-17): add DB-backed deleted document cleanup scan * test(task-17): add sustained health observation window * test(task-17): wire concurrent upload e2e fixtures * test(task-17): remove obsolete app lifespan worker invariant --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: huangheng <huangheng@aperag.local>

earayu and others added 19 commits April 27, 2026 01:44

earayu mentioned this pull request Apr 27, 2026

test(indexing): cross-modality ModalityWorker.derive+sync contract supplement #1730

Merged

4 tasks

earayu and others added 3 commits April 27, 2026 12:49

earayu and others added 2 commits April 27, 2026 13:41

earayu merged commit 00ae644 into main Apr 27, 2026
4 checks passed

earayu deleted the chenyexuan/celery-wave3-cutover branch April 27, 2026 06:21

This was referenced Apr 27, 2026

feat(celery Wave 4): real backends + 11 production-readiness items #1731

Merged

chore(makefile): remove residual add-license no-op target #1788

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(celery Wave 3): T3.1 hard-cut legacy Celery indexing layer#1729

feat(celery Wave 3): T3.1 hard-cut legacy Celery indexing layer#1729
earayu merged 24 commits into
mainfrom
chenyexuan/celery-wave3-cutover

earayu commented Apr 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

earayu commented Apr 27, 2026

Summary

Per-commit highlights

Architectural changes

Files deleted

Final-HEAD gates

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant